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Abstract 

This  PI  was  awarded  the  AFOSR  grant  ”  Semi-supervised  Discriminative  Structured  Prediction” 
(Grant  No.:  FA9550-10-1-0335).  The  project  was  funded  for  the  period  of  08/01/10  to  07/31/13 
with  the  total  amount  of  $359,320.  This  report  summarizes  the  progress  made  throughout  the 
project  period. 


1  Research  Progress 

The  proposed  research  develops  cutting-edge  machine  learning  techniques  to  improve  the  performance 
of  a  wide  spectrum  of  robust  and  intelligent  classification  tasks.  Structured  prediction,  one  of  four 
major  challenges  in  statistical  machine  learning,  is  a  classification  or  regression  problem  with  non-iid 
data  where  the  prediction  variables  are  typically  interdependent  in  complex  ways  with  dependencies 
encoded  in  a  graphical  model  to  capture  the  sequential,  spatial,  relational  or  recursive  structure  of 
output  variables.  Semi-supervised  learning,  another  example  of  the  four  major  challenges  in  statistical 
machine  learning,  is  a  technique  which  makes  use  of  both  unlabeled  and  labeled  data  for  training  — 
typically  a  small  amount  of  labeled  data  with  a  large  amount  of  unlabeled  data.  Traditinal  approaches 
optimize  surrogate  functions  of  performance  measures  for  structured  prediction.  In  this  project,  we 
propose  to  design  novel  machine  learning  algorithms  that  directly  optimize  performance  measures  for 
classification  and  ranking  problems  and  maximize  various  arbitrarily  defined  margins  with  the  goal  to 
improve  the  generalization  performance. 

Consistent  with  the  stated  objectives  of  the  project,  the  project  has  made  considerable  progress 
along  the  following  four  directions.  First,  we  have  proposed  a  boosting  method  that  directly  minimizes 
0-1  loss  and  maximizes  variously  targeted  arbitrarily  defined  margins  for  binary  classification.  Second, 
we  have  developed  a  semi-supervised  boosting  method  that  directly  minimizes  a  combination  of  0-1 
loss  over  labeled  examples  and  soft  0-1  loss  over  unlabeled  examples,  and  maximizes  various  margins 
over  both  labeled  and  unlabeled  examples  where  the  margin  of  an  unlabeled  example  is  defined  to 
be  an  expected  soft  margin.  Third,  we  have  proposed  a  boosting  method  that  directly  minimizes 
0-1  loss  and  maximizes  various  margins  for  multiclass  classification.  Fourth,  we  have  developed  an 
optimization  method  for  linear  models  and  a  boosting  method  that  builds  boosted  trees  to  directly 
maximize  performance  measures  for  ranking. 

The  major  findings  along  the  above  directions  are  described  in  more  detail  in  the  following  four 
subsections. 

1.1  Direct  Boost  for  Binary  Classification 

Let  ? i  =  {hi,  denote  the  set  of  all  possible  weak  classifiers  that  can  be  produced  by  the  weak 

learning  algorithm,  where  a  weak  classifier  hj  £  71  is  a  mapping  from  an  instance  space  X  to  y  = 
{—1,1}.  The  hj s  are  not  assumed  to  be  linearly  independent,  and  hi  is  closed  under  negation,  i.e. , 
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both  h  and  —h  belong  to  TL.  We  define  C  of  TL  as  the  set  of  mappings  that  can  be  generated  by  taking 
a  weighted  average  of  classifiers  from  TL: 


C 


Y  othh{x)  |  ah  >  0 

hen 


(1) 


Given  a  set  of  training  data  V  =  {( xi,yi ),  •  •  • ,  ( xn,yn )}  independently  drawn  from  an  unknown 
but  fixed  probability  distribution  p(X,Y),  we  consider  finding  /  £  C  that  minimizes  the  empirical 
classification  error  in  (2)  and  has  a  good  generalization  performance. 


1  n 

error (/,  V)  =  -  Y  HVi  +  Vi) 


(2) 


where  y*  =  argmax.ygy  yf{xi )  and  l(-)  is  the  classification  error  function,  i.e.,  an  indicator  function. 
Due  to  the  nonconvexity,  nondifferentiability  and  discontinuity  of  the  classification  error  function  and 
the  max  operation  for  y*,  direct  minimization  of  (2)  seems  impossible.  In  the  following,  we  describe 
novel  methods  that  directly  minimize  (2)  and  maximize  margins. 

1.1.1  Minimizing  0-1  Loss 

DirectBoost,  we  propose,  works  by  sequentially  running  an  iterative  greedy  coordinate  descent  algo¬ 
rithm,  and  each  time  directly  minimizes  the  true  classification  error  (2)  instead  of  a  weighted  classifica¬ 
tion  error  in  AdaBoost  [6].  Consider  the  tth  iteration,  the  ensembled  classifier  is  ft{x)  =  Y%= l  akhk{x ), 
where  previous  t  —  1  weak  classifiers  hk{x )  and  corresponding  weights  a*,,  k  =  1  ,■■■  ,t  —  1  have  been 
selected  and  determined.  Denote  a(x{)  =  J2h= \  akhk(x{),  then  the  inference  function  for  sample  X{  can 
be  written  as, 

Ft(xi,  y)  =y  ht{xi)at  +  ya(xi)  (3) 

We  now  describe  the  greedy  coordinate  descent  al¬ 
gorithm  that  sequentially  minimizes  a  0-1  loss,  please 
see  the  details  in  [25].  Since  TL  is  closed  in  negation, 
we  only  care  about  these  that  are  positive.  We  first 
sort  |a(.Tj)|,z  =  1,  ■  ■  ■  ,n  in  an  increasing  order.  Then 
for  a  weak  learner,  we  visit  each  sample  in  the  order 
that  |a(xj)|  is  increasing,  and  we  compute  the  slope 
and  the  intercept  of  F(xi,yi)  =  yihk{xi)a  +  yia(xi). 
Let  ej  =  |a(a;j)|.  If  the  slope  is  positive  and  a(xi )  is 
positive,  the  sample  margin  is  positive  for  at  >  0,  thus 
there  is  no  error  update  on  the  righthand  side  of  e^; 
if  the  slope  is  positive  and  the  intercept  is  negative, 
there  is  an  error  reduction  on  the  righthand  side  of  e^; 
if  the  slope  is  negative  and  the  intercept  is  positive, 
Figure  1:  An  example  of  computing  minimum  0-1  there  is  an  error  increment  on  the  righthand  side  of  e^; 
loss  of  a  weak  learner  over  f  samples.  if  fjie  slope  is  negative  and  the  intercept  is  negative, 

the  sample  margin  is  always  negative  for  at  >  0,  thus  there  is  no  error  update  on  the  righthand  side  of 
ej.  We  incrementally  calculate  the  classification  error  on  intervals  of  ijS,  and  choose  the  interval  with 
the  minimum  classification  error.  Consider  an  example  with  4  samples.  Suppose  for  a  weak  learner,  we 
have  Ft(xi,yi),i  =  1, 2,  3, 4  as  shown  in  Figure  3.  At  at  =  0,  samples  x\  and  X2  have  negative  margins, 
thus  they  are  misclassified,  the  error  is  2  and  the  error  rate  is  50%.  We  incrementally  update  the 
classification  error  on  intervals  of  et,i  =  1,2,  3, 4:  For  Ft(xi,yi),  its  slope  is  negative  and  its  intercept 
is  negative,  sample  x\  always  has  a  negative  margin  for  at  >  0,  thus  there  is  no  error  update  on  the 


Ft  (£3,2/3) 
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righthand  side  of  e\.  For  Ft(x2,y2 ),  its  slope  is  positive  and  its  intercept  is  negative,  then  when  at  is 
at  the  right  side  of  £2,  sample  X2  has  a  positive  margin  and  becomes  correctly  classified,  so  we  update 
the  error  by  -1,  the  error  rate  is  reduced  to  25%.  For  Ft(:x;^  7/3),  its  slope  is  negative  and  its  intercept 
is  positive,  then  when  at  is  at  the  right  side  of  e 3,  sample  23  has  a  negative  margin  and  becomes 
misclassified,  so  we  update  error  by  1,  the  error  rate  changes  to  50%  again.  For  Ft(x 4,2/4),  its  slope 
is  positive  and  its  intercept  is  positive,  sample  24  always  has  a  positive  margin  for  at  >  0,  thus  there 
is  no  error  update  on  the  righthand  side  of  £4.  We  finally  have  the  minimum  error  rate  25%  on  the 
interval  of  [I2,  £3]. 

We  pick  the  weak  learners,  each  having  an  interval  with  the  largest  classification  error  reduction. 
Since  the  classification  error  is  flat  on  the  interval  with  a  minimum  classification  error,  we  determine 
the  optimal  weight  of  each  selected  weak  learner  by  minimizing  the  exponential  loss  within  the  corre¬ 
sponding  interval.  We  only  add  the  weak  learner  with  the  smallest  exponential  loss  into  the  ensembled 
classifier.  We  repeat  this  procedure  until  the  training  error  reaches  its  minimum,  which  is  0  in  a  data 
separable  case.  We  then  go  to  the  next  stage,  explained  below,  that  aims  to  maximize  margins.  A  nice 
property  of  the  above  greedy  coordinate  descent  algorithm  is  that  the  classification  error  is  monotoni- 
cally  decreasing,  and  its  computational  complexity  is  Q(tMn )  where  M  is  the  number  of  weak  learners 
considered  by  the  weak  learner  algorithm  at  each  round  and  is  identical  to  the  one  in  AdaBoost. 


1.1.2  Maximizing  Margins 


The  margins  theory  [15]  provides  an  insightful  analysis  for  the  success  of  AdaBoost  where  the  authors 
proved  that  the  generalization  error  of  any  ensemble  classifiers  is  bounded  in  terms  of  the  entire 
distribution  of  margins  of  training  examples,  as  well  as  the  number  of  training  examples  and  the 
complexity  of  the  base  classifiers,  and  AdaBoost’s  dynamics  has  a  strong  tendency  to  increase  the 
margins  of  training  examples.  This  view  motivates  us  to  prove  that  the  generalization  error  of  any 
ensemble  classifiers  is  bounded  in  terms  of  the  statistics  of  margins  of  training  examples,  as  well  as  the 
number  of  training  examples  and  the  complexity  of  the  base  classifiers,  and  propose  a  coordinate  ascent 
algorithm  to  directly  maximize  several  types  of  margins  just  right  after  the  training  error  reaches  a 
(local)  minimum. 


Margin 


The  margin  of  a  labeled  example  ( Xi,yt )  with  respect  to 
an  ensembled  classifier  ft(x)  =  El— 1  akhk(xi)  is  defined  to 
be 


rrij  = 


Vi  Efc=l  ^khk{Xi 
Efc=l  ak 


(4) 


We  denote  =  E {=1  Viakhk(xi),  bi)t  =  yi.ht{xi )  G  {-1,  +1} 
and  c  =  Efe=i  ak ,  then  the  margin  on  the  zth  example 
(. Xi,yt )  can  be  rewritten  as  to,  =  The  derivative  of 

the  margin  on  ith  example  with  respect  to  at  is  calculated 

d mi  _  bj'tc-aj 

aS  dat  ~  (c+at)2  ‘ 

Since  c>  at,  depending  on  the  sign  of  b^t,  the  derivative 
of  the  margin  on  the  ith  sample  (xt,  yf)  is  either  positive  or 
negative,  which  is  irrelevant  to  the  value  of  at-  This  is  also 
true  for  the  second  derivative  of  the  margin.  Therefore, 
the  margin  on  the  ith  example  (x*,  iji)  with  respect  to  at  is 
either  concave  when  it  is  monotonically  increasing  or  convex 
when  it  is  monotonically  decreasing.  See  Figure  2  for  a  simple  illustration. 

Consider  a  greedy  coordinate  ascent  algorithm  maximizing  the  average  margin  over  n'  worst  training 
examples,  mavera ge  n'.  Apparently  maximizing  the  minimum  margin  is  a  special  case  by  choosing  n'  =  1. 


Figure  2:  Margin  curves  of  six  examples.  At 
points  Pi,P2,P3  and  P4,  the  median  example 
is  changed.  At  points  P2  and  P4,  the  set  of 
bottom  n!  =  3  examples  are  changed. 
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Figure  2  is  a  simple  illustration  with  six  training  examples.  Our  aim  is  to  maximize  the  average  margin 
over  the  bottom  3  examples.  The  interval  [0,  d]  of  at  indicates  an  interval  where  the  training  error  is 
zero.  On  the  point  of  d,  the  sample  S4  alters  its  margin  from  positive  to  negative,  which  causes  the 
training  error  to  jump  from  0  to  1/6.  As  shown  in  Figure  2,  the  margin  of  six  training  examples  is 
either  monotonically  increasing  or  decreasing. 

We  have  designed  an  efficient  greedy  coordinate  ascent  algorithm  that  sequentially  maximizes  the 
average  margin  of  bottom  n'  examples,  see  its  details  at  [25].  We  add  the  weak  learner,  which  has  the 
largest  increment  of  the  average  margin  over  bottom  n'  examples,  into  the  ensembled  classifier.  This 
procedure  terminates  if  there  is  no  increment  in  the  average  margin  over  bottom  n'  examples  over  all 
weak  learners. 

e-Relaxation:  Unfortunately,  there  is  a  fundamental  difficulty  in  the  greedy  coordinate  ascent 
algorithm  that  maximizes  the  average  margin  of  bottom  n'  samples:  It  gets  stuck  at  a  corner,  a 
coordinatewise  maximum  solution  but  not  an  optimal  solution,  from  which  it  is  impossible  to  make 
progress  along  any  coordinate  direction.  We  propose  an  e-relaxation  method  [2]  to  overcome  this 
difficulty.  The  main  idea  is  to  allow  a  single  coordinate  to  change  even  if  this  worsens  the  margin 
function.  When  a  coordinate  is  changed,  however,  it  is  set  to  e  plus  or  e  minus  the  value  that  maximizes 
the  margin  function  along  that  coordinate,  where  e  is  a  positive  number.  If  e  is  small  enough,  the 
algorithm  can  eventually  approach  a  small  neighborhood  of  the  optimal  solution. 

We  have  also  designed  a  similar  greedy  coordinate  ascent  algorithm  to  directly  maximize  the  bottom 
n't h  sample  margin. 

1.1.3  Experimental  Results 

We  evaluate  the  performance  of  DirectBoost  on  10  UCI  data  sets  and  compare  with  those  of  Ad- 
aBoost  [6],  LogitBoost  [9],  LPBoost  with  column  generation  [5]  and  BrownBoost  [7].  For  all  the 
algorithms  in  our  comparison,  we  use  decision  trees  with  depth  of  either  1  or  3  as  weak  learners  since 
for  the  small  datasets,  decision  stumps  (tree  depth  of  1)  is  already  strong  enough.  DirectBoost  with 
decision  trees  is  implemented  by  a  greedy  top-down  recursive  partition  algorithm  to  find  the  tree  but 
differently  from  AdaBoost  and  LPBoost,  since  DirectBoost  does  not  maintain  a  distribution  over  train¬ 
ing  samples.  Instead,  for  each  splitting  node,  DirectBoost  simply  chooses  the  attribute  to  split  on  by 
minimizing  0-1  loss  or  maximizing  the  predefined  margin  value.  In  all  the  experiments  that  e-relaxation 
is  used,  the  value  of  e  is  0.01. 


Datasets 

N 

D 

depth 

AdaBoost  LogitBoost 

LPBoost 

BrownBoost 

DirectBoostavg 

DirectBoost|vg 

DirectBoostor(jer 

Tic-tac-toe 

958 

9 

3 

1.47(0.7) 

1.47(1.0) 

2.62(0.8) 

3.66(1.3) 

0.63(0.4) 

1.15(0.8) 

1.05(0.4) 

Diabetes 

768 

8 

3 

27.71(1.7) 

27.32(1.3) 

26.01(3.3) 

26.67(2.6) 

25.62(2.5) 

25.49(3.0) 

23.4(3.7) 

Australian 

690 

14 

3 

14.2(1.8) 

16.23(2.6) 

14.49(4.4) 

13.77(4.6) 

14.06(3.6) 

13.33(3.0) 

13.48(2.9) 

Fourclass 

862 

2 

3 

1.86(1.3) 

2.44(1.6) 

3.02(2.3) 

2.33(1.7) 

2.33(1.0) 

1.86(1.3) 

1.74(1.5) 

Ionosphere 

351 

34 

3 

9.71(3.7) 

9.71(3.1) 

8.57(2.7) 

10.86(2.8) 

7.71(3.0) 

8.29(2.7) 

7.71(4.4) 

Splice 

1000 

61 

3 

5.3(1. 4) 

5. 3(2. 6) 

4.8(1. 4) 

6.1(1. 1) 

4. 8(0. 7) 

4. 0(0. 5) 

6. 7(1. 6) 

Cancer-wdbc 

569 

29 

1 

4.25(2.5) 

4.42(1.4) 

3.89(1.5) 

4.25(2.2) 

4.96(3.0) 

4.07(2.0) 

3.72(2.9) 

Cancer-wpbc 

198 

32 

1 

27.69(7.6) 

30.26(7.3) 

26.15(10.5) 

28.72(8.4) 

27.69(8.1) 

24.62(7.6) 

27.18(10.0) 

Heart 

270 

13 

1 

17.41(7.7) 

18.52(5.1) 

19.26(8.1) 

18.15(7.2) 

18.15(5.1) 

16.67(7.5) 

18.15(7.6) 

Adult 

6414  14 

3 

15.6(0.7) 

15.39(0.8) 

16.2(1.1) 

15.56(0.9) 

16.25(1.7) 

15.28(0.8) 

15.8(1.1) 

Table  1:  Percent  test  errors  of  AdaBoost,  LogitBoost,  soft  margin  LPBoost  with  column  generation,  Brown- 
Boost,  and  three  DirectBoost  methods  on  10  UCI  datasets  each  with  N  samples  and  D  variables. 


We  partition  each  UCI  dataset  into  five  parts  with  the  same  number  of  samples  for  five- fold  cross 
validation.  In  each  fold,  we  use  three  parts  for  training,  one  part  for  validation,  and  the  remaining  part 
for  testing.  The  validation  set  is  used  to  choose  the  optimal  model  for  each  algorithm:  For  AdaBoost 
and  LogitBoost,  the  validation  data  is  used  to  perform  early  stopping  since  there  is  no  nature  stopping 
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criteria  for  these  algorithms.  We  run  the  algorithms  until  convergence  where  the  stopping  criterion  is 
that  the  change  of  loss  is  less  than  le-6,  and  then  choose  the  ensemble  classifier  from  the  round  with 
minimum  error  on  the  validation  data.  For  BrownBoost,  we  select  the  optimal  cutoff  parameter  by  the 
validation  set,  and  The  cutoff  parameters  for  BrownBoost  are  chosen  from  {0.0001,  0.001,  0.01,  0.03, 
0.05,  0.08,  0.1,  0.14,  0.17,  0.2}.  LPBoost  maximizes  the  soft  margin  subject  to  linear  constraints,  its 
objective  is  equivalent  to  the  average  margin  of  bottom  n'  samples  [19],  thus  we  set  the  same  candidate 
parameters  n'/n  =  {0.01,0.05,0.1,0.2,0.5,0.8}  for  them.  For  LPBoost,  the  termination  rule  we  use  is 
same  to  the  one  in  [5],  and  we  select  the  optimal  regularization  parameter  by  the  validation  set.  For 
DirectBoost,  the  algorithm  terminates  when  there  is  no  increment  in  the  targeted  margin  value,  and 
we  select  the  model  with  the  optimal  n'  by  the  validation  set. 

We  use  DirectBoostavg  to  denote  our  method  that  runs  Algorithm  1  first  and  then  maximizes  the 
average  of  bottom  n'  margins  without  e-relaxation,  DirectBoostl[vg  to  denote  our  method  that  runs 
Algorithm  1  first  and  then  maximizes  the  average  margin  of  bottom  n'  samples  with  e-relaxation,  and 
DirectBoostorder  to  denote  our  method  that  runs  Algorithm  1  first  and  then  maximizes  the  bottom  n't h 
margin  with  e-relaxation.  The  means  and  standard  deviations  of  test  errors  are  given  in  Table  1.  Clearly 
DirectBoostavg,  DirectBoost|vg  and  DirectBoostorder  outperform  other  boosting  algorithms  in  general, 
specially  DirectBoost|vg  is  consistently  better  than  AdaBoost,  LogitBoost,  LPBoost  and  BrownBoost 
over  all  data  sets  except  Cancer-wdbc.  Among  the  family  of  DirectBoost  algorithms,  DirectBoostavg 
wins  on  two  datasets  where  it  searches  the  optimal  margin  solution  in  the  region  of  zero  training 
error,  this  means  that  keeping  the  training  error  at  zero  may  lead  to  good  performance  in  some  cases. 
DirectBoostor(jer  wins  on  three  other  datasets,  but  its  results  are  unstable  and  sensitive  to  n' .  With 
e-relaxation,  DirectBoost^vg  searches  the  optimal  margin  solution  in  the  whole  parameter  space  and 
gives  the  best  performance  on  the  remaining  5  data  sets.  It  is  well  known  that  AdaBoost  performs 
well  on  the  datasets  with  a  small  test  error  such  as  Tic-tac-toe  and  Fourclass,  it  is  extremely  hard  for 
other  boosting  algorithms  to  beat  AdaBoost.  Nevertheless,  DirectBoost  is  still  able  to  give  even  better 
results  in  this  case.  For  example,  on  Tic-tac-toe  data  set,  the  test  error  becomes  0.63%,  more  than  half 
the  error  rate  reduction.  Our  method  would  be  more  valuable  for  those  who  value  prediction  accuracy, 
which  might  be  the  case  in  areas  of  medical  and  genetic  research. 

Table  2  shows  the  number  of  iterations  and 
total  run  times  (in  seconds)  for  AdaBoost,  LP¬ 
Boost  and  DirectBoost|vg  at  the  training  stage, 
where  we  use  the  Adult  dataset  with  10000  train¬ 
ing  samples.  All  these  three  algorithms  employ 

decision  trees  with  a  depth  of  3  as  weak  learn-  Table  2:  Number  of  iterations  and  total  run  times  (in  sec- 
ers.  The  experiments  are  conducted  on  a  PC  with  onds)  in  training  stage  on  Adult  dataset  with  10000  training 
Core2  Duo  2.6GHz  CPU  and  2G  RAM.  Clearly  samples  and  the  depth  of  DecisionTrees  is  3. 
DirectBoost|vg  takes  less  time  for  the  entire  training  stage  since  it  converges  much  faster.  LPBoost 
converges  in  less  than  three  hundred  rounds,  but  as  a  total  corrective  algorithm,  it  has  a  greater  com¬ 
putational  cost  on  each  round.  To  handle  large  scale  data  sets  in  practice,  similar  to  AdaBoost,  we 
can  use  many  tricks.  For  example,  we  can  partition  the  data  into  many  parts  and  use  distributed 
algorithms  to  select  the  weak  learner. 

We  also  have  conducted  experiments  to  evaluate  the  noise  robustness,  we  find  that  DirectBoostorder 
has  an  impressive  noise  tolerance  property. 

Please  see  [25]  for  more  technical  detail  and  experimental  results. 

1.2  Direct  Boost  for  Semi-supervised  Classification 

Consider  semi-supervised  binary  classification,  assume  we  are  given  not  only  a  set  of  n  labeled  examples, 
V1  =  {(xi,  yi),  •  ■  ■ ,  (xn,  yn )}  but  also  a  set  of  rn  unlabeled  examples,  Du  =  {xn+i,  •  ■  ■ ,  xn+m}.  Just  as 
in  supervised  learning  case  for  boosting,  the  goal  here  is  that  using  the  combined  set  of  labeled  and 


#  of  iterations 

Total  running  times 

AdaBoost 

117852 

31168 

LPBoost 

286 

167520 

DirectBoosteavg 

1737 

606 
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unlabeled  examples  V1  U  T>u  to  construct  an  ensemble  classifier  /  G  C  that  minimizes  the  following  0-1 
loss  and  has  good  generalization  performance. 

n  n+m 

error(f,Vl  UVU)  =  ^2l(yif(xi)  <0) ^  ^  P(y\xi)Hyf(xi)  <  0)  (5) 

i= 1  i=n-\- 1  y^y 

Here  the  first  term  denotes  the  classihcation  error  for  labeled  data,  and  the  second  term  represents  soft 
classification  error  for  unlabeled  data,  p(y\xi )  =  1+(i-yf(Xi)  and  7  is  a  trade-off  parameter  that  controls 
the  influence  of  the  unlabeled  data.  The  minimum  entropy  and  variance  semi-supervised  boosting 
methods  [26,  29]  optimize  the  surrogates  (log-loss  and  negative  sigmoid  function)  of  (5). 

1.2.1  Minimizing  0-1  Loss 

In  semi-supervised  case,  DirectBoost  first  runs  the  algorithm  in  section  1.1.1  that  minimizes  the  0-1 
loss  over  labeled  data  V1  to  construct  a  good  initial  enssembled  classifier,  we  then  run  use  an  iterative 
greedy  coordinate  descent  algorithm  that  directly  minimizes  (5),  and  estimate  p(y\xi)  through  an  iter¬ 
ative  scheme.  That  is,  given  an  estimate  Po(y\xi),  (5)  yields  an  ensemble  classifier  /i(ay),  which  leads 
to  a  new  estimate  pi(y\xi)  through  Algorithm  1  below.  The  pi(y\xi)  is  expected  to  be  more  accurate 
than  po(y\xi)  for  p(y\xi)  because  additional  information  from  labeled  and  unlabeled  data  has  been  used 
in  constructing  of  fi(xi)  through  po(y\xi).  Consider  the  tih  iteration,  the  ensemble  classifier  is  ft(x)  = 
J2k=  1  &khk{x ),  where  previous  t  —  1  weak  classifiers  hk(x)  and  corresponding  weights  a^,  k  =  1,  ■  ■  • ,  t  —  1 
have  been  selected  and  determined.  We  estimate  p{y\xf)  by  the  logit  function  of  the  ensemble  classifier  of 
the  previous  step,  p(y\xi)  =  1+exp^jt  da-))’  Then  the  combined  0-1  loss  (5)  to  be  a  stepwise  function  of 

OLt- 

Denote  a(xi)  =  X)fc=i  07/7,(07),  then  the  inference  func¬ 
tion  for  sample  Xi  can  be  written  as, 

Ft(xi,  y)  =  y  ht(xi)at  +  ya(xi)  (6) 

which  is  a  linear  function  with  respect  to  at  with  slope 
yht(xi)  and  intercept  ya(xi).  For  a  labeled  example 
( Xi,yt )  G  V1,  the  inference  function  Ft(xi,yi )  >  0  denotes 
this  example  is  correctly  classified;  otherwise,  it  is  misclassi- 
fied.  These  two  states  exchange  at  point  at  =  —  ,  which 

is  denoted  as  the  critical  point  e*,  the  value  of  the  combined 
0-1  loss  (5)  has  ^  difference  at  et.  For  example,  in  Figure  3, 
Ft(x\.  y\ )  changes  its  sign  from  negative  to  positive  at  ei, 
then  (x'i ,  yi )  €  V1  is  correctly  classified  at  at  >  e\  and  (5) 
has  y-  reduction  on  the  right  side  of  ei.  Ft(x2,y2 )  changes 
its  sign  from  positive  to  negative  at  62,  then  ( X2,y2 )  €  T>1 
becomes  misclassified  at  at  >  &2  and  (5)  has  ^  increment  on 
the  right  side  of  e2-  To  compute  the  error  of  unlabeled  ex¬ 
amples,  we  use  yt  =  sign(a(xj))  to  denote  the  pseudo  label 
of  an  unlabeled  exampfe  x*  G  T>u.  Then,  similarly,  the  sign  of  Ft(xi,yi)  denotes  x\  G  T>u  is  “correctly 
classihed”  or  “misclassified”.  Again,  the  critical  point  for  Xi  G  Vu  is  at  =  e*  =  —  ,  and  the  value 

of  the  combined  0-1  loss  (5)  has  ^\p(+1\x')  1l3:dl  difference  at  e*.  In  Figure  3,  Ft(x;i.  73)  changes  its 
sign  from  positive  to  negative  at  e 3,  and  (5)  has  7  increment  on  the  right  side  of  e 3.  It 

is  obviously  that  the  intercept  is  always  positive  for  an  unlabeled  example  ay  G  T>u.  The  critical  points 
et  =  i  =  1,  ■  •  • ,  n  +  m  divide  at  into  (at  most)  n  +  m  +  1  intervals,  each  interval  has  the  value 

of  a  combined  0-1  loss  (5). 

Thus  we  can  design  a  greedy  coordinate  descent  algorithm  that  sequentially  minimizes  (5)  [27]. 
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Figure  3:  An  example  of  computing  mini¬ 
mum  combined  0-1  loss  for  a  weak  learner  ht 
over  2  labeled  samples  ( X\,yi )  and  (27,2/2); 
and,  1  unlabeled.  sa.m.nle  xi. 


1.2.2  Maximizing  Margins 

Similar  to  the  supervised  case,  we  now  describe  the  algorithm  to  directly  maximize  various  margins 
over  both  labeled  and  unlabeled  examples.  The  margin  of  a  labeled  example  (xt,  yf)  €  V1  w.r.t.  an 
ensemble  classifier  ft(xi)  is  defined  to  be 

,  j  _  ViMxi)  ,  . 

Vi  ?—\t  (v 

Z^k= 1  ak 

where  ip\  can  be  interpreted  as  a  measure  of  how  confident  this  labeled  example  is  correctly  classified. 
For  an  unlabeled  example  xt  €  T>u,  we  define  its  margin  w.r.t  an  ensemble  classifier  ft(x)  as 


vt  =  ^p(y\xi)-^r^  =  (My  =  ik»)  - 1) 


(8) 


y&y 


X)fc=l  ak  X)fc= 1  ak 

We  can  sort  <p\  and  (pf  in  an  increasing  order  respectively,  and  consider  n!  worst  labeled  examples 
n'  <  n  and  m!  worst  unlabeled  exan 
average  margin  over  those  examples  is 


ri  <  n  and  m!  worst  unlabeled  examples  m'  <  m  that  have  smaller  margins,  then  the  combined 


*r  avg(7i 


n' 


E 

i&B1  , 


v\ 


1 


■7—  E 

ml  ' 

i£Bu  , 


(9) 


where  Bln,  denotes  the  set  of  n!  labeled  examples  having  the  smallest  margins  and  Bfn,  denotes  the  set  of 
m!  unlabeled  examples  having  the  smallest  margins,  and  again  7  is  a  trade-off  parameter  that  controls 
the  influence  of  the  unlabeled  data,  n'  indicates  how  much  we  relax  the  hard  margin  on  unlabeled 
examples,  and  we  set  n!  based  on  knowledge  of  the  number  of  noisy  examples  in  V1  [?].  The  higher  the 
noise  rate,  the  larger  the  n'  should  be  used,  m'  controls  the  relaxation  of  the  margin  distribution  over 
the  unlabeled  data.  A  smaller  m'  makes  the  algorithm  focus  more  on  the  unlabeled  examples  close  to 
the  decision  boundary.  In  this  section,  we  consider  (9)  as  our  objective. 

For  an  unlabeled  example  x,  €  Du ,  if  we  let  p(y\xi)  = 
then  E  is  a  comPlex  Unction  of  at.  Again 

use  p(y\xi)  =  —  instead  to  estimate  the  conditional  proba¬ 

bility  by  the  previous  step.  Denote  iji  =  —  -JEw)  —  then  the 
estimated  margin  for  X{  G  T>u  is  denoted  to  be 

ft-i(xi)  +atht{xi) 


Vi  =Vi 


Z)t=i  afc  +  at 


(10) 


For  a  given  weak  hypothesis,  we  then  find  at  maximize  (11)  instead 


ravg(n',m/) 


■E 

i£Bl  , 


v\  +  i- 


■  E 


Vi 


(11) 


Figure  4:  An  example  of  computing 
at  that  maximizes  <Pavg(n,'=2,m'=2) 


It  can  be  shown  that  (11)  is  a  quasiconcave  function  for  a  given  weak 
hypothesis.  This  property  allows  us  to  design  an  efficient  algorithm 
that  maximizes  (11)  efficiently. 

The  iterative  greedy  coordinate  ascent  algorithm  that  sequen- 
tailly  and  approximately  maximizes  (9)  can  be  designed,  where  at 
each  iteration,  update  p(y\xi),  We  add  the  weak  hypothesis,  which 
has  the  largest  increment  of  <Pavg(n',m')>  int°  the  ensemble  classifier, 
with  the  weight  that  leads  to  maximum  (11).  The  stopping  criterion  is  that  if  there  is  no  increment  in 
VavgCn1  ,rnr)  over  all  weak  hypotheses,  then  the  algorithm  achieves  a  stationary  point. 

Since  <£avg(n',m')  is  non-differentiable,  a  fundamental  difficulty  in  the  greedy  coordinate  ascent  algo¬ 
rithm  proposed  above  is  that:  the  algorithm  gets  stuck  at  a  corner  from  which  it  is  impossible  to  make 
progress  along  any  coordinate  direction.  To  overcome  this  difficulty,  we  employ  an  e-relaxation  method. 
The  main  idea  is  to  allow  a  single  coordinate  to  change  even  if  this  worsens  the  margin  function.  When 
a  coordinate  is  changed,  however,  it  is  set  to  e  plus  or  e  minus  the  value  that  maximizes  the  margin 
function  along  that  coordinate,  where  e  is  a  positive  number. 
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1.2.3  Experimental  Results 

We  compare  the  performance  of  SSDirectBoost  with  those  of  AdaBoost,  DirectBoost,  ASSEMBLE  [1], 
and  EntropyBoost  [29]  on  UCI  datasets.  Since  we  concentrate  on  the  case  that  the  labeled  data 
are  limited  while  the  unlabeled  data  are  adequate  in  the  training  process,  we  randomly  select  a  small 
portion  of  data  as  the  labeled  training  examples,  the  remaining  examples  are  used  as  unlabeled  training 
data,  validation  data  and  testing  data.  The  dimension  of  data  and  the  number  of  separate  labeled  (L), 
unlabeled  (U),  validation  (V),  and  test  (T)  examples  for  each  dataset  are  given  in  Table  3.  We  use  the 
validation  data  to  choose  the  optimal  model  for  each  algorithm.  For  AdaBoost,  the  validation  data  is 
used  to  perform  early  stopping.  We  run  AdaBoost  until  convergence  where  the  stopping  criterion  is 
that  the  change  of  loss  is  less  than  le-6,  and  then  choose  the  ensemble  classifier  from  the  round  with 
minimum  error  on  the  validation  data.  The  early  stopping  technique  is  applied  on  ASSEMBLE,  and 
EntropyBoost  since  there  is  no  nature  stopping  criteria.  Moreover,  for  ASSEMBLE  and  EntropyBoost, 
the  tradeoff  parameters  that  control  the  influence  of  unlabeled  data  are  chosen  by  the  validation  set 
on  the  values  {1,0.5,0.1,0.05,0.01,0.005,0.001}.  For  DirectBoost,  the  parameter  n!  is  chosen  by  the 
validation  data  on  the  values  {1,  n/5,  n/3,  n/2}.  For  SSDirectBoost,  the  parameter  n!  is  chosen  on  the 
values  {1,  n/5,  n/3,  n/2}  and  in'  is  chosen  on  the  values  {1,  m/3,  m/2}  by  the  validation  set,  and  7  is 
set  to  0.1.  The  stopping  criterion  of  SSDirectBoost  is  defined  as  line  12  in  Algorithm  2,  SSDirectBoost 
terminates  at  the  margin  maximization  solution,  thus  we  need  not  apply  early  stopping  on  validation 
data. 


Data 

Dim. 

No.  of  examples 

L  U  V  T 

Depth 

AdaBoost 

DirectBoost 

ASSEMBLE  EntropyBoost 

SSDirectBoost 

Mushroom 

22 

20 

1000 

50 

7054 

1 

8.81(1.9) 

5.38(1.8) 

5.05(0.7) 

5. 1(1.6) 

2. 2(0. 5) 

Adult 

14 

50 

1000 

50 

47742 

1 

20.33(2.0) 

20.14(1.9) 

19.77(1.8) 

20.53(2.2) 

19.9(2.0) 

Australian 

14 

50 

300 

40 

300 

1 

15.67(1.1) 

15.0(0.5) 

14.73(1.1) 

14.73(0.9) 

13.67(0.9) 

Liver 

6 

30 

200 

115 

200 

1 

41.5(4.3) 

41.1(5.9) 

36.9(5.7) 

37.2(5.2) 

36.3(5.3) 

Sonar 

60 

20 

100 

88 

100 

1 

33.8(3.8) 

33.6(4.7) 

31.8(5.1) 

35.4(4.4) 

28.0(2.5) 

Kr-vs-Kp 

36 

50 

1000 

50 

2096 

1 

10.46(2.2) 

8.66(2.4) 

8. 2(2. 2) 

8.5(2. 1) 

7.65(2.0) 

Cod-Rna 

7 

50 

1000 

50 

58435 

3 

17.33(2.4) 

16.71(2.6) 

18.87(2.7) 

19.6(3.0) 

14.44(1.8) 

Splice 

61 

100 

400 

100 

400 

3 

13.32(1.0) 

12.96(2.2) 

14.12(2.0) 

14.04(1.1) 

10.72(2.1) 

Magic 

100 

100 

1000 

100 

17820 

3 

20.6(2.0) 

19.72(1.3) 

19.87(1.3) 

20.8(1.2) 

19.51(1.1) 

Spambase 

57 

100 

1000 

100 

3401 

3 

10.2(1.5) 

11.0(1.0) 

10.55(0.8) 

10.45(1.0) 

8.96(0.9) 

Table  3:  Mean  error  rates  (in  %)  and  standard  deviation  of  each  boosting  method  on  UCI  datasets  when  decision  trees 
(with  depth  of  1  or  3)  are  used  as  weak  learners. 

Table  3  shows  the  results  of  different  boosting  methods  when  decision  trees  (with  depth  of  1  or 
3)  are  used  as  weak  learners.  As  we  expected,  semi-supervised  boosting  algorithms  outperform  the 
supervised  methods,  the  results  indicate  that  the  unlabeled  data  does  help  to  improve  generalization 
performance.  Furthermore,  the  proposed  SSDirectBoost  overcomes  the  gradient  based  semi-supervised 
boosting  methods  in  general  by  taking  advantage  of  maximizing  the  margin  objective  function  di¬ 
rectly  on  T>1  U  T>u.  When  decision  trees  with  depth  of  three  are  used,  we  noticed  that  ASSEMBLE 
and  EntropyBoost  sometimes  perform  worse  than  supervised  boosting  algorithms,  but  our  proposed 
SSDirectBoost  consistently  gives  significantly  better  results  in  this  case. 

Please  see  [27]  for  more  technical  detail  and  experimental  results. 

1.3  Direct  Boost  for  Multi-class  Classification 

In  multi-class  classification,  we  want  to  predict  the  labels  of  examples  lying  in  some  set  X .  We  are 
provided  a  training  set  of  labeled  examples  V  =  (aq,  r/i),  ■  ■  ■ ,  ( xn ,  yn)}i  where  each  example  Xi  G  X  has 


a  unique  yi  label  in  the  set  {1,  •  •  • ,  K}.  Again  denote  TL  =  {hi, ...,  hi}  as  the  set  of  all  possible  weak 
classifiers  that  can  be  produced  by  the  weak  learning  algorithm,  where  a  weak  classifier  hj  £  TL  is  a 
mapping  from  an  instance  space  X  to  y  =  {1,  -  ■  • ,  K}.  Boosting  combines  weak  classifiers  to  form  a 
highly  accurate  combined  classifier  for  multiclass  classification  by  making  a  prediction  according  to  the 
weighted  plurality  vote  of  the  classifiers: 

y  =  argmaxye{1...  ^}/(x,  y),  (12) 

where  f(x,y)  =  J2heHahl(h(x)  =  y),ah  £  1Z.  The  empirical  error  for  a  multi-class  classification 
problem  is  given  by  (2).  Our  goal  is  to  find  /  =  (f(x,y),y  £  y)  that  attains  a  small  empirical  error 
on  V  and  also  generalizes  well.  In  the  following,  we  describe  novel  methods  that  directly  minimize  (2) 
and  maximize  various  margins. 

1.3.1  Minimizing  0-1  Loss 

Similar  to  the  binary  classification,  we  use  a  greedy  coordinate  descent  algorithm  to  directly  minimize 
the  empirical  error  (2)  and  construct  an  ensembled  classifier.  Consider  the  tth  iteration,  the  ensemble 
classifier  is  ft(x,y )  =  J2k-i  ak^-{hk(x)  =  2/) ,  Vy  €  y,  where  previous  t  —  1  weak  classifiers  hy(x) 
and  corresponding  weights  «&,  k  =  1,  —  1  have  been  selected  and  determined.  Let  a(xj,y )  = 

=  y).  We  define  the  inference  functions  for  example  X{  as 

Ft{xi,  y)  =  ft(xi,y)  =  a(xi,y)  +  atl(ht(x)  =  y),  (13) 

which  is  a  linear  function  of  at  with  intercept  a(xt,y )  and  slope  1  (hk(x)  =  y).  Obviously,  the  inference 
function  is  either  a  line  with  slope  1  or  a  horizontal  line.  The  inference  functions  are  used  to  compute 
the  empirical  error  (2).  More  specifically,  given  a  weak  learner  ht  £  TL,  for  each  example  pair  {xi,yf), 
there  are  3  scenarios  to  compute  the  empirical  error,  see  Figure  5.  Scenario  1  is  the  case  that  ht(xi )  =  y*. 
Ft(xi,yi )  is  a  line  with  slope  1,  and  assume  that  l  =  arg  m&Xy^y^^y.  a(xi,  y),  then  Ft(xi,l )  is  a  line 
with  slope  0.  The  intersection  of  Ft(xi,yi)  and  Ft(xi,l )  is  at  at  =  a(xi,l)  —  a(xi,yi).  Thus  when  at 
is  set  on  the  left  side  of  the  intersection  point,  there  is  an  error  for  example  x*  and  otherwise  there  is 
no  error.  Scenario  2  is  the  case  that  ht(xt )  =  y,  y  ^  yt,  and  a(xi,yt )  >  a(xt,y)  Vy  £  y,  y  ^  y,;.  Then 
Ft(xi,y)  is  a  line  with  slope  1,  and  Ft(xi,yt)  is  a  line  with  slope  0.  The  intersection  point  of  Ft(xi,y ) 
and  Ft(xi,yi )  is  at  at  =  a(xi,y)  —  a(xi,yi).  Thus  when  at  is  set  on  the  right  side  of  the  intersection 
point,  there  is  an  error  for  example  xt  and  otherwise  there  is  no  error.  Scenario  3  is  the  case  that 
ht(xi )  =  y,  and  y  /  y*,  and  31  £  y,  l  /  y*  such  that  a(xt,l)  >  a(xi,yt ),  in  this  case  there  is  always  an 
error  no  matter  what  value  at  is. 


Figure  5:  Three  scenarios  to  compute  the  empirical  error  of  a  weak  learner  ht  over  an  example  pair  {xi,yf), 
where  l  denotes  the  incorrect  label  with  highest  score,  and  p  denotes  the  intersection  point  that  results  empirical 
error  change.  The  red,  bold  line  for  each  scenario  represents  the  inference  function  of  example  Xi  and  its  true 
label  yi . 
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1.3.2  Maximizing  Margins 

Similar  to  binary  classification,  we  now  describe  the  algorithm  that  directly  maximizes  various  mar¬ 
gins  for  multi-class  classification.  Define  the  margin  of  a  labeled  example  ( Xi,yt )  with  respect  to  an 
ensembled  multi-class  classifier  ft(x,y )  =  J2k=i  ak^-(hk{xi)  =  y),  Vy  £  T  to  be 


ft{x,yj ) 

QLh 


max 

y&y,y+y\ 


ft{x,y) 

QLh 


(14) 


We  can  sort  m,  in  an  increasing  order,  and  consider  n'  worst  training  examples  n'  <  n  that  have  smaller 
margins,  then  define  the  average  margin  over  those  n!  labeled  examples  by  gavg  n>.  Formally, 


9avg  n' 


1 

n' 


E 

^Bn' 


m 


(15) 


where  Bn /  denotes  the  set  of  n'  labeled  examples  having  the  smallest  margins. 

We  have  designed  an  algorithm  to  maximize  (15).  Given  a  weak  learner  ht  €  Ti  at  tth  iteration,  let 
c  =  £i=\  |o:fc | ,  then  the  margin  on  the  example  ( Xi,yi )  can  be  rewritten  as, 


a(xi,yi )  +  atl(ht(xi)  =  Vi)  a(xi,y)  +  atl(ht(xi)  =  y) 

rrii  =  - 1 — j - max  - 1 — : - 

c+\at\  yey,y^yt  c+\at\ 


(16) 


Consider  the  case  that  at  >  0.  For  each  example  pair  ( xt,yt ),  there  are  three  scenarios  of 
(16)  to  consider,  as  shown  in  Figure  6.  Scenario  1  is  the  case  that  ht(xi )  =  yt,  and  assume  that 
l  =  argma Xyey,y^yi  a(xi ,  y),  then  m,  =  _  This  corresponds  to  the  curve  which  is  mono- 

tonically  increasing  in  Figure  6.  Scenario  2  is  the  case  that  ht{xi)  =  l,  y  ^  yt.  and  a{xi,l)  >  a(xi,y), 
Vy  G  y,  y  7^  yt,  then  rnt  =  _  This  corresponds  to  the  curve  which  is  monotonically 

decreasing  in  Figure  6.  Scenario  3  is  the  case  that  ht(xi )  =  y,  and  y  ^  yt,  and  3/  €  y,  l  /  iji  such  that 
a(xi,l )  >  a(xi,y),  in  this  case  the  margin  curve  of  mt  has  two  pieces.  When  at  <  a(xt,y)  —  a(xt,l), 
m-i  =  and  when  at  >  a(xt,y )  —  a(xj,  /),  m-i  =  _  The  scenarios  for  the  case 

that  at  <  0  can  be  similarly  identified. 


Figure  6:  Three  scenarios  of  margin  curve  of  a  weak  learner  ht  over  an  example  pair  {xi,yf). 

Thus  in  the  margin  maximization  phase,  the  key  step  is  to  find  the  value  of  at  within  an  interval 
[0,  d]  that  maximize  (15)  for  a  given  ht-  Finding  the  exact  solution  is  computationally  difficult  since 
the  examples  in  Scenario  3  can  either  intersect  with  the  examples  in  Scenario  1  or  intersect  with  the 
examples  in  Scenario  2.  Fortunately,  we  can  prove  that  (15)  is  quasi-concave,  which  allows  us  to 
design  a  line  search  algorithm  that  maximizes  (15)  efficiently  by  checking  the  derivative  of  (15).  We 
have  designed  a  greedy  coordinate  ascent  algorithm  that  sequentially  maximizes  the  average  margin 
of  bottom  n'  examples,  it  terminates  if  there  is  no  increment  in  the  average  margin  over  bottom 
n'  examples  over  ht .  Again  since  (15)  is  non-differentiable  at  turning  points,  the  coordinate  ascent 
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algorithm  may  get  stuck  at  a  corner  from  which  it  is  impossible  to  make  progress  along  any  coordinate 
direction.  To  overcome  this  difficulty,  we  use  an  e-relaxation  method,  which  allows  a  single  coordinate 
to  change  even  if  this  worsens  the  objective  value.  When  a  coordinate  is  changed,  it  is  set  to  e  plus 
(or  e  minus)  the  value  that  maximizes  the  objective  function  along  that  coordinate,  where  e  is  a  small 
positive  number.  If  e  is  small  enough,  the  algorithm  can  eventually  approach  a  small  neighborhood  of 
the  optimal  solution. 

1.3.3  Experimental  Results 

To  evaluate  the  performance  of  the  MCDB  algorithm,  we  conduct  experiments  with  12  UCI  datasets. 
For  comparison,  we  also  report  the  results  of  AdaBoost.Ml  [6],  AdaBoost.MH  [16],  SAMME  [30],  and 
GD-MCBoost  [14].  All  these  algorithms  use  multi-class  base  classifiers  except  AdaBoost.MH,  which 
reduces  the  multi-class  problem  to  a  set  of  binary  classification  problems.  The  classification  error  is 
estimated  either  by  a  test  error  or  five-fold  cross-validation.  The  datasets  come  with  pre-specihed 
training  and  testing  sets  are  evaluated  by  test  error,  where  n'  is  set  to  j  for  MCDB  and  the  number 
of  rounds  is  set  to  maximum  of  5000  for  each  method.  For  datasets  which  are  evaluated  by  cross- 
validation,  we  partition  them  into  five  parts  evenly  for  5-fold.  In  each  fold,  we  use  three  parts  for 
training,  one  part  for  validation,  and  the  remaining  part  for  testing.  We  use  the  validation  data  to 
choose  the  optimal  model  for  each  algorithm.  For  AdaBoost.Ml,  AdaBoost.MH,  SAMME,  and  GD- 
MCBoost,  the  validation  data  is  used  to  perform  early  stopping.  We  run  these  algorithms  with  a 
maximum  of  5000  iterations,  and  then  choose  the  ensemble  classifier  from  the  round  with  minimum 
error  on  the  validation  data.  For  MCDB,  the  parameter  n'  is  chosen  on  the  values  {Ijfgifjf’f’f’T1} 
by  the  validation  set.  The  stopping  criterion  of  MCDB  is  defined  as  line  8  in  Algorithm  3  where  MCDB 
terminates  at  the  margin  maximization  solution,  thus  we  need  not  to  apply  early  stopping.  In  all  the 
experiments,  the  value  of  e  is  set  to  be  0.01  and  the  value  of  th  is  set  to  be  le-5. 

An  overview  of  these  datasets  is  shown  in  Ta¬ 
ble  4.  In  the  #  Examples  column,  the  number 
of  training/test  examples  are  listed  for  datasets 
coming  with  pre-specified  training  and  testing 
sets,  and  the  entire  number  of  examples  are  given 
for  the  rest  datasets.  The  original  Poker  dataset 
has  25,010  training  examples  and  1,000,000  exam¬ 
ples  for  testing.  Since  the  test  data  is  very  large, 
we  randomly  divide  it  equally  into  two  parts,  and 
add  them  to  training  and  testing  sets  respectively, 
thus  its  training  size  becomes  525,010  and  test 
size  becomes  500,000. 

First,  we  restrict  the  base  classifiers  to  smaller 
trees  to  test  the  performance  of  each  algorithm  when  the  base  classifiers  are  very  weak.  We  exclude 
the  results  of  AdaBoost.MH  as  all  the  rest  algorithms  use  multi-class  base  classifiers,  and  we  want  to 
compare  the  performance  of  each  algorithm  with  the  same  hypothesis  space  Ti.  Table  5  shows  the 
results  of  different  methods  when  multi-class  decision  trees  with  a  depth  of  3  are  used  as  weak  learners. 
With  smaller  trees,  MCDB  gives  the  best  results  on  all  datasets  indicating  that  MCDB  only  requires 
very  weak  base  classifiers  even  if  there  is  no  exact  weak  learner  condition  for  MCDB.  GD-MCBoost 
achieves  the  second  best  accuracy,  and  this  algorithm  also  requires  weaker  base  classifiers  since  it  is 
able  to  boost  any  type  of  weak  learners  with  non-zero  directional  derivatives  [14].  We  do  not  report  its 
results  on  Poker525k  dataset  since  its  one  iteration  takes  more  than  12  hours  to  run  by  authors’  rnatlab 
code.  For  SAMME,  the  weak  learner  conditions  can  be  satisfied  easily,  but  it  couldn’t  drive  down  the 
training  error  when  the  base  classifier  is  very  weak,  and  its  performance  is  much  worse.  AdaBoost.Ml 


Data 

#  Examples 

K 

#  Variables 

Error  Estimation 

Abalone 

4177 

28 

8 

5-CV 

Car 

1728 

4 

6 

5-CV 

Krkopt 

28056 

18 

6 

5-CV 

Letter 

20000 

26 

16 

5-CV 

Nursery 

12960 

5 

8 

5-CV 

Poker525k 

525010/500000 

10 

11 

test  error 

Segmentation 

210/2100 

7 

19 

test  error 

Waveform 

5000 

3 

21 

5-CV 

Yeast 

1484 

10 

8 

5-CV 

Glass 

214 

6 

10 

5-CV 

Wine 

178 

3 

13 

5-CV 

Vowel 

990 

11 

10 

5-CV 

Table  4:  Description  of  datasets 
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gives  the  worst  results,  and  it  is  not  able  to  boost  the  base  classifiers  for  4  of  12  datasets,  as  shown  in 
Table  5. 


-  MCDB  training 

•  •  MCDB  test 

\ 

■  v  ■  -x . . 

Figure  7:  Learning  curves  on  training  data,  test  data  by  Algorithm  1  and  3  respectively. 


Data 

AdaBoost.Ml 

SAMME 

GD-MCBoost 

MCDB 

Abalone 

- 

74.20(1.8) 

74.62(1.5) 

74.03(2.0) 

Car 

10.96(2.5) 

4.75(1.0) 

3.60(1.1) 

2.78(0.8) 

Krkopt 

- 

64.33(0.9) 

26.55(0.4) 

22.76(0.7) 

Letter 

- 

24.94(0.9) 

5.40(1.3) 

4.89(0.3) 

Nursery 

9.70(1.5) 

3.26(0.7) 

0. 2(0.0) 

0.02(0.0) 

Poker525k 

49.16 

69.09 

- 

30.09 

Segmentation 

8.29 

6.43 

6.0 

5.1 

Waveform 

17.8(1.2) 

16.96(1.2) 

16.2(1.1) 

14.38(1.1) 

Yeast 

43.65(2.6) 

44.73(4.5) 

43.6(3.5) 

42.43(2.8) 

glass 

29.52(10.7) 

31.9(8.0) 

27.0(7.4) 

26.19(10.8) 

wine 

8.57(4.9) 

7.43(4.8) 

7.54(5.3) 

3.43(4.7) 

vowel 

- 

19.19(2.6) 

9. 2(2. 6) 

5.66(1.9) 

Table  5:  Test  error  (and  standard  deviation)  of  multi-class  boosting  methods  on  UCI  datasets,  using 
decision  trees  with  a  depth  of  3. 

With  the  same  hypothesis  space  74  (trees  with  a  depth  of  3),  Algorithm  1  usually  achieves  a  lower 
training  classification  error  rate.  The  left  panel  of  Figure  7  shows  a  typical  training  error  curve  on  car 
dataset,  and  the  middle  panel  shows  the  corresponding  test  error  curve.  Once  Algorithm  1  terminates 
at  a  coordinatewise  local  minimum,  Algorithm  3  can  still  drive  down  the  test  error  even  when  the 
training  error  does  not  decrease,  as  shown  in  the  right  panel  of  Figure  7. 

We  next  investigate  how  these  algorithms  perform  with  more  powerful  base  classifiers.  We  tried  all 
tree  depths  in  the  candidate  set  {3,5,8,12}  for  each  dataset.  This  time  we  compare  the  algorithms  not 
restricted  in  the  same  hypothesis  space  74,  so  we  also  add  AdaBoost.MH  in  the  comparison.  As  shown 
in  Table  6,  among  all  the  methods,  MCDB  gives  the  most  accurate  results  in  9  of  the  12  datasets,  and 
its  results  are  close  to  the  best  results  produced  by  other  methods  for  the  remaining  3  datasets. 

Please  see  [28]  for  more  technical  detail  and  experimental  results. 

1.4  Direct  Optimization  for  Ranking 

First  we  describe  DirectRank  that  learns  a  linear  ranking  function  by  directly  optimizing  any  ranking 
measures.  Suppose  that  a  set  of  training  queries  Qs  =  {qi,  q2, 1  ■  ■  qn}  is  given,  and  a  set  of  documents 
dj=  {du  ,  di2,-  ■  • ,  4jjTO(gi)}  is  retrieved  for  each  query  q,.  Let  m(qi)  denote  the  size  of  the  set  of  retrieved 
documents,  which  varies  for  different  queries.  Every  document  dtj  is  associated  with  a  manually-labeled 
judgment  yij  G  {n,^2,  ■  •  •  ,rq},  that  denotes  the  relevance  of  a  document  to  the  query.  We  define  the 
order  r;  >-  rq_ \  >~  •  •  •  >~  r±,  where  >-  means  the  preference  relationship.  A  L-dimensional  feature 
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Data 

AdaBoost.Ml 

AdaBoost.MH 

SAMME 

GD-MCBoost 

MCDB 

Abalone 

76.41(1.4) 

75.33(1.2) 

73.70(1.7) 

74.62(1.5) 

73.44(1.8) 

Car 

3.36(0.8) 

2.84(0.6) 

3.65(0.9) 

2. 8(0. 8) 

2.67(0.8) 

Krkopt 

14.3(0.3) 

11.68(0.3) 

12.71(0.2) 

12.20(0.3) 

11.04(0.2) 

Letter 

3.48(0.3) 

3. 1(0.1) 

4.88(0.3) 

3.37(0.2) 

3. 1(0. 2) 

Nursery 

0.12(0.1) 

0.03(0.0) 

0.16(0.1) 

0. 0(0.0) 

0. 0(0.0) 

Poker525k 

30.19 

2.01 

18.74 

- 

2.77 

Segmentation 

4.86 

6.14 

5.1 

6.0 

4.52 

Waveform 

15.2(1.4) 

14.56(1.4) 

15.08(1.0) 

15.2(0.8) 

14.26(1.1) 

Yeast 

41.69(1.8) 

41.82(2.1) 

41.22(3.1) 

43.2(3.6) 

40.23(2.5) 

glass 

27.14(9.3) 

29.52(9.2) 

24.76(8.7) 

24.0(6.8) 

24.76(9.9) 

wine 

8.57(4.9) 

9.16(5.3) 

7.43(4.8) 

7.54(5.3) 

3.43(4.7) 

vowel 

5.96(2.9) 

7.68(1.8) 

6.25(2.3) 

5. 6(3.0) 

5.66(1.9) 

Table  6:  Test  error  (and  standard  deviation)  of  multi-class  boosting  methods  on  UCI  datasets,  using 
decision  trees  with  a  maximum  depth  of  12. 

vector  is  created  for  each  query-document  pair  (qi,  dij),i  =  1,  •  •  • ,  n,  j  =  1,  •  •  • ,  m(qi )  and  is  denoted  as 
g (dij\qi)  =  (gi(dij\qi),---  ,gL(dij\qi)). 

The  objective  of  ranking  is  to  construct  a  ranking  function  /  such  that  for  each  query  the  retrieved 
documents  can  be  assigned  ranking  scores  using  the  function  and  then  be  ranked  according  to  the 
scores.  The  learning  process  turns  out  to  be  that  of  optimizing  the  ranking  measure  which  represents 
the  agreement  between  the  permutation  by  relevance  judgments  and  the  ranking  yielded  by  a  ranking 
function.  We  use  the  linear  ranking  function, 

f(g{dij\qi))  =  a- g{dij\qi))  (17) 

where  the  weight  vector  a  =  (aq,  «2,  ■  •  • ,  a/.)  is  the  model  parameter.  Assume  the  ranking  measure  is 
NDCG.  Since  NDCG  is  non-convex,  non-differentiable  and  discontinuous  with  respect  to  a,  thus  we 
cannot  use  standard  optimization  algorithms  such  as  gradient  ascent  to  optimize  directly. 

DirectRank  is  an  iterative  coordinate  ascent 
method  to  directly  optimize  NDCG.  For  each  it¬ 
eration,  there  will  be  only  one  coordinate  param¬ 
eter  updated,  denoted  as  ak,  while  others  keep 
unchanged.  The  rationale  of  this  idea  is  that  the 
ranking  function  is  written  as  a  one-dimensional 
linear  function, 

L 

f(s(dij\qi ))  =  ak  ■  gk(dij\qi)  +  ^  augi(dij\qi) 

l^k 

Since  gk{dij\qi)  is  constant  with  respect  to  ak,  and 
so  is  the  second  term,  we  can  re-write  these  two 
quantities  as  bij  and  ,  and  convert  the  equation 
above  to, 

f(g(dij\qi))  =  bij  ■  ak  +  dij  (18) 

Note  that  for  each  document  dt,j  retrieved  by  each 
query  (/*,  there  is  a  linear  function  of  ak.  Given 
an  input  of  ak,  each  document  will  get  an  output 
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Figure  8:  The  top-r  (t=3)  candidates  for  each  of  the 
two  queries  are  marked  as  bold.  Between  each  two  bound¬ 
aries,  the  NDCG  value  is  shown.  We  can  see  the  inter¬ 
vals  between  p^  and  p§  achieve  the  best  NDCG. 


score  from  this  linear  function.  The  order  of  such  scores  actually  reflects  the  order  of  the  documents 
which  further  determines  the  NDCG  value. 

This  is  illustrated  in  Figure  8,  where  each  of  the  lines  represents  a  scoring  function  for  a  document. 
At  any  point  of  a*,,  the  rank  of  the  linear  function  output  scores  is  equivalent  to  the  rank  of  the 
documents.  Note  that  a  little  change  of  cannot  lead  to  a  jump  of  NDCG  value,  unless  it  changes  the 
order  of  the  top-r  documents.  Such  change  in  order  happens  only  at  the  point  where  two  lines  intersect. 
We  denote  the  set  of  such  points  as  jumping  points.  In  Figure  8,  we  draw  ten  lines  corresponding  to  ten 
documents,  which  belong  to  two  queries  of  the  top-3  ranked  documents.  (pi,P2,  •  •  ■  P12)  are  jumping 
points. 

Theoretically,  we  can  search  all  the  intersections  to  acquire  all  possible  snapshots  of  ranked  doc¬ 
uments.  Because  any  two  of  the  non-parallel  lines  will  form  an  intersection,  the  total  number  of 
intersections  then  is  m(qi)2.  When  m(qi )  is  large,  this  effort  is  not  only  time-consuming  but  also  com¬ 
pletely  unnecessary,  because  in  real-world  applications  we  are  merely  interested  in  the  rank  of  top-r 
candidates,  the  NDCG  metric  is  always  truncated  to  a  certain  level  r,  and  usually  r  <10.  As  a  result, 
the  jumping  point  size  between  top-r  candidates  is  quite  limited,  and  does  not  increase  linearly  as  the 
document  size  increases.  Therefore,  we  can  efficiently  find  all  the  jumping  points  on  one  coordinate  and 
find  the  interval  which  achieves  the  optimal  NDCG  value.  Then  we  determine  the  optimal  parameter 
value  by  maximizing  the  likelihood  of  top-r  ranked  documents  for  a  ranking  by  human  judgment  within 
these  intervals. 

The  decision  tree  representation  for  the  features  of  the  query-document  pair  is  capable  of  capturing 
more  complex  relations  between  original  features,  and  has  been  used  by  LambdaMart  [22]  to  signif¬ 
icantly  improve  the  ranking  performance  over  LambdaRank.  Thus  we  integrate  regression  trees  into 
DirectRank  effectively  and  conveniently  by  following  a  stage-wise  strategy.  We  use  MART  trees  [10]  as 
our  weak  learners,  and  in  order  to  enhance  the  stability,  we  restrict  the  new  weight  a in  range  [a,  6], 
where  we  empirically  set  the  hyper-parameters  between  [0.1, 0.5]  in  our  experiments.  If  the  output  is 
beyond  the  range,  we  just  take  the  border  values. 

1.4.1  Experimental  Results 

We  have  applied  DirectRank  to  two  large  datasets,  Yahoo  Challenge  Track  1  data  and  Microsoft  30K 
web  data.  We  achieved  the  best  results.  For  example,  for  the  Yahoo  Challenge  Track  1  dataset,  since 
we  use  a  linear  function  in  DirectRank,  to  have  a  fair  comparison,  we  compare  it  with  LambdaRank  [3], 
whose  ranking  function  is  also  linear  and  has  the  best  reported  result.  Table  7  shows  that  DirectRank 
(DR)  outperforms  LambdaRank  (LR).  We  also  compare  DirectRank  with  other  baselines,  such  as 
SmoothGrad  (SG)  [11],  AdaRank  (AR)  [24],  ad  hoc  coordinate  ascent  (CA)  [13],  RankBoost  (RB)  [8] 
and  ListNet  (LN)  [4].  Table  8  shows  the  running  times  for  these  methods.  Given  a  randomly  generated 
starting  point,  DirectRank  converges  after  approximately  20  rounds  and  takes  a  total  time  of  2.3  hours. 
SmoothGrad  is  the  fastest,  however,  it  does  not  perform  as  well  as  DirectRank. 


DR 

LR 

SG 

AR 

CA 

RB 

LN 

TRAIN 

0.762 

- 

0.741 

0.728 

0.750 

0.734 

0.709 

VALID 

0.757 

- 

0.738 

0.723 

0.744 

0.730 

0.700 

TEST 

0.760 

0.757 

0.739 

0.729 

0.745 

0.732 

0.705 

DR 

SG 

AR 

CA 

RB 

LN 

Hours 

2.3 

0.3 

11.8 

45.3 

24.5 

23.8 

Table  7:  NDCG@10  on  Yahoo  Challenge  Track  1  dataset.  Table  8:  Running  time  on  Yahoo  Challenge  Track  1  dataset. 

As  tree-based  models  generally  outperform  linear  models,  we  compare  our  system  with  two  state- 
of-the-art  systems,  MART  [10]  and  LambdaMART  on  two  large  datasets.  The  maximum  number  of 
trees  is  set  to  1000.  In  Yahoo  data,  the  number  of  leaf  nodes  is  set  to  10,  and  more  leaves  do  not 
contribute  to  the  final  performance  significantly  with  respect  to  the  official  measure  NDCG@10.  On 
Microsoft  30K  web  data,  we  adjust  the  number  of  leaf  nodes  as  10,  30  and  50. 
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@1 

@10 

MT 

LM 

DR 

MT 

LM 

DR 

10 

.4582 

.4602 

.4894 

.4887 

.4943 

.4985 

30 

.4823 

.4830 

.4917 

.4994 

.4997 

.5055 

50 

.4744 

.4883 

.4911 

.5022 

.5006 

.5061 

Table  10:  NDCG  score  of  tree  models  on  Microsoft  30K 


@1  @10 

MT 

LM 

DirectRank 

.7084  .7768 

.7167  .7791 

.7199  .7810 

Table  9:  NDCG  scores  of  tree  models  on  Yahoo  Challenge,  web  data  with  varying  number  of  leaf  nodes,  including 


including  MART  (MT),  LambdaMART  (LM).  MART(MT),  LambdaMART  (LM),  DirectRank  (DR). 


DirectRank  shows  significant  superiority  to  NDCG@1  over  Microsoft  30K  web  data,  especially  when 
the  number  of  leaf  nodes  is  quite  small,  and  in  other  cases  (Table  9  and  10)  DirectRank  still  performs 
slightly  better.  We  find  the  average  number  of  documents  per  query  is  greatly  different  in  the  two 
datasets,  about  23  in  Yahoo  dataset  and  72  in  Microsoft  data.  Since  MART  treats  all  documents 
equally,  more  documents  may  in  some  sense  have  a  negative  influence  on  the  objective  NDCG@1;  thus, 
it  would  be  more  likely  to  acquire  improvement  by  adopting  an  accurate  objective.  When  the  number 
of  leaf  nodes  increases,  the  two  baselines  improve  significantly  in  NDCG@1,  while  our  DirectRank  is 
more  stable  and  effective  in  performance.  Moreover,  even  for  a  small  number  of  leaf  nodes,  DirectRank 
works  very  well.  Finally,  MART  in  [21]  gets  a  higher  performance  by  using  a  complete  binary  tree  with 
different  depths,  and  all  tree-based  algorithms  here  are  implemented  in  a  fair  manner  by  restricting 
the  maximum  number  of  leaf  nodes. 

Please  see  [19]  for  more  technical  detail  and  experimental  results. 
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Appendix  1:  Executive  Summary  of  Lili  Guo’s  Master  Thesis 

The  main  challenge  in  learning-to-rank  for  information  retrieval  is  the  difficulty  to  directly  optimize 
ranking  measures  to  automatically  construct  a  ranking  model  from  training  data.  It  is  mainly  due  to 
the  fact  that  the  ranking  measures  are  determined  by  the  order  of  ranked  documents  rather  than  the 
specific  values  of  ranking  model  scores,  thus  they  are  non-convex,  nondifferentiable  and  discontinuous. 
To  address  this  issue,  listwise  approaches  have  been  proposed  where  loss  functions  are  defined  either 
by  exploiting  a  probabilistic  model  or  by  optimizing  upper  bounds  or  smoothed  approximations  of 
ranking  measures.  Even  though  very  promising  results  have  been  achieved,  there  is  still  a  mismatch 
between  target  cost  and  optimization  cost.  In  this  work,  we  present  a  novel  learning  algorithm  that 
directly  optimizes  the  ranking  measures  without  resorting  to  any  upper  bounds  or  approximations. 
Our  approach  is  essentially  an  iterative  greedy  coordinate  descent  method  in  optimization.  For  each 
iteration,  we  only  update  one  parameter  along  one  coordinate  with  all  others  fixed.  Since  the  ranking 
measure  is  a  stepwise  function  of  a  single  parameter,  we  exploit  an  exhaustive  line  search  algorithm  to 
locate  the  interval  with  the  smallest  ranking  measure  along  each  coordinate.  We  pick  the  coordinate 
that  leads  to  the  largest  reduction  of  ranking  measure.  In  order  to  determine  the  optimal  value  of 
the  parameter  for  the  selected  coordinate,  we  construct  a  probabilistic  framework  for  the  permutation, 
and  maximize  the  likelihood  of  top-m  ranked  documents.  This  iterative  procedure  is  continued  until 
convergence.  We  conduct  experiments  of  five  datasets  selected  from  Microsoft  LETOR  datasets,  our 
experimental  results  show  that  the  proposed  direct  rank  algorithm  outperforms  several  well-known 
state-of-the-art  ranking  algorithms. 
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